arXiv:1504.07124vl [q-bio.PE] 27 Apr 2015 


Twisted trees and inconsistency of tree estimation 


when gaps are treated as missing data - the impact 
of model mis-specihcation in distance corrections 

Emily Jane McTavish^’^’* Mike Steel^ Mark T. Holder^’^ 


April 28, 2015 


1. Department of Ecology and Evolutionary Biology, University of Kansas, 
Lawrence KS, USA 

2. Heidelberg Institute for Theoretical Studies, Heidelberg, Germany 

3. Biomathematics Research Centre, University of Canterbury, Christchurch, 
New Zealand 

* to whom correspondence should be addressed. Email: ejmctavish@gmail.com. 


Abstract 

Statistically consistent estimation of phylogenetic trees or gene 
trees is possible if pairwise sequence dissimilarities can be converted 
to a set of distances that are proportional to the true evolutionary 


distances. Susko et al. (2004) reported some strikingly broad results 
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about the forms of inconsistency in tree estimation that can arise if 
corrected distances are not proportional to the true distances. They 
showed that if the corrected distance is a concave function of the true 
distance, then inconsistency due to long branch attraction will occur. If 
these functions are convex, then two “long branch repulsion” trees will 
be preferred over the true tree - though these two incorrect trees are 
expected to be tied as the preferred true. Here we extend their results, 
and demonstrate the existence of a tree shape (which we refer to as a 
“twisted Farris-zone” tree) for which a single incorrect tree topology 
will be guaranteed to be preferred if the corrected distance function is 
convex. We also report that the standard practice of treating gaps in 
sequence alignments as missing data is sufficient to produce non-linear 
corrected distance functions if the substitution process is not indepen¬ 
dent of the insertion/deletion process. Taken together, these results 
imply inconsistent tree inference under mild conditions. For example, 
if some positions in a sequence are constrained to be free of substi¬ 
tutions and insertion/deletion events while the remaining sites evolve 
with independent substitutions and insertion/deletion events, then the 
distances obtained by treating gaps as missing data can support an 
incorrect tree topology even given an unlimited amount of data. 
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Introduction 


Distance-based methods are fast and statistically consistent estimators of 
tree topology if the input distances converge (with increasing data) to val¬ 
ues that are proportional to the evolutionary distance between tips. An 
evolutionary distance is the number of substitution events that have oc¬ 
curred along the path separating two tips. Typically a distance correction 
procedure is applied to the observed sequence differences to obtain a more 
accurate estimate of the evolutionary distance between pairs of sequences. 
However, in many cases it is not possible to correctly account for the evolu¬ 
tionary processes which generated the data. In other words, it is not always 
possible to accurately estimate the evolutionary distance for pairwise mea¬ 
surements of dissimilarity. 


In a pioneering paper, Susko et al. (2004) showed how model misspeci- 


fication can lead to transformed evolutionary distances that are non-linear 
with respect to evolutionary distance (i.e. concave or convex), and for which 
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there are regions of tree space for which neighbor joining will be inconsistent. 
We extend this result further (Theorem in Appendix A) by showing how 
virtually all misspecified correction functions lead to (strong) inconsistency 
(an incorrect tree will be unambiguously favoured by neighbor-joining). A 
main focus of this paper involves a particular study of model misspecihcation 
in distance corrections that treats gaps as missing data. 


Model 


For variants of the simplest model of sequence evolution (Jukes and Cantor 


1969), all nucleotides are equally exchangeable and the simple proportion 


of sites that differ, the p-distance, is a sufficient statistic for estimating 
an evolutionary distance. Under such a model. Mg, the expected p-distance 
between a pair of taxa is a function of the evolutionary distance (path length 
in the tree) t between the taxa, that is, we have IEg|p] = dit), where the 
function p is a monotonically (strictly) increasing function of t which is 
analytic (i.e. has a power series expansion, and so derivatives exist of all 
orders) and satisfies p(0) = 0. For example, for the Jukes-Cantor model we 
have g{t) = f If the distances are corrected under a (possibly 

different), fully exchangeable model, Mf, then the estimated evolutionary 
distance t is usually computed from the p-distance by simply using the ‘plug- 
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in’ formula t = f~^{p). 

Thus, for any generating model for which p converges in probability to¬ 
wards its expected value IEg[p] = g{t) (e.g. i.i.d. site substitution models) 
the estimated evolutionary distance t will converge towards t = h{t), where 
h{t) = Note here that both p and t are random variables, while 

t is simply a function of t. Notice that this ‘transformed’ evolutionary dis¬ 


tance t is not exactly the expected value of t, even when f = g (Tajima 


1993), since the expectation of a non-linear function of random variable is 


not generally equal to the function evaluated at the expected value of that 
variable. Nevertheless, for any i.i.d. site substitution model, the difference 
between i and the expected value of t decays towards zero as the sequence 
length grows. 

Notice also that when f = g (i.e. the correction model matches the 
generating model) then i = t. However, in general, t need not be equal to t, 
except when t = 0. For example, if the generating model is the Jukes-Cantor 
model with some form of among-site rate heterogeneity and the correcting 
model that does not assume the same form of rate heterogeneity then t can 


depend on t in a quite non-linear way (Soubrier et ah, 2012). 

In this paper we are interested in determining when the transformed 
evolutionary distances t will favour a different tree to the tree generating 
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the data. In particular, we explore an example of how ignoring the process 
of insertion and deletion (referred to jointly as indels hereafter) can lead to 
statistical inconsistency in an otherwise correctly modeled substitution pro¬ 
cess. Inconsistency occurs in this case even when the alignment of residues 
is correct. 


Susko et al. (2004) studied general properties of t as a function of t. If 


this function is linear (i.e. when the correction model matches the gener¬ 
ating model up to a constant factor) then distance-based tree estimation 
will be statistically consistent. If the function is concave, inference can be 
inconsistent and positively misleading due to long branch attraction. They 
also show that if the function is convex, two long branch repulsion trees are 
expected to be equally preferred over the correct tree. In Appendix A we 
establish a more general result: outside of the special case where the correct¬ 
ing generating model matches the generating model up to a constant factor, 
there will always exist tree shapes for which neighbor-joining will estimate a 
single incorrect tree from t. The tree shapes used to demonstrate this result 
are the familiar Felsenstein-zone tree (Fig. [H^.; [Felsenstein 1978) and a tree 
that we refer to as the “twisted Farris-zone” tree (Fig. m- 
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Figure 1: (A) The Felsenstein-zone tree with branch lengths used in the 
proof of Lemma (B) The ‘twisted’ Farris-zone tree used in the proof of 
Lemma IH 


The gaps as missing data convention 

It is common practice to treat a gap in a sequence as missing data in phy¬ 
logenetic estimation based on distances, parsimony scores or likelihoods. In 
the context of a pairwise distance calculation, this treatment means that 
positions with a gap in either sequence are disregarded because they can¬ 
not be counted as either a similarity or a difference. Omitting indels from 
distance corrections obvionsly forfeits the opportnnity for learning about 
the evolutionary distance from insertions and deletion events. However, one 
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may hope that treating sites with gaps as missing data would not perturb a 


distance estimate that relies solely on substitution events. If the substitution 
and indel processes are completely independent, this is the case. 

Consider the case of sequences that are generated by: a time-reversible 
stochastic process of insertions and deletion, and a model of substitutions 
for which there is a statistically consistent distance correction. If the align¬ 
ment is known without error, then the only effect of the indel process is 
to introduce a fraction of sites, z, for which one sequence lacks a residue 
and the other sequence has a residue. These are the gapped positions in a 
pairwise alignment. Note that the presence of gap in a column in the align¬ 
ment is not handled by deleting the column. The gap only affects pairwise 
comparisons involving a sequence which contains a gap. A full description 
for z for a full alignment would require some additional notation to indicate 
which sequences are being compared. Our argument below applies to any 
pairwise distance, so we simply use z{t) to describe the expected proportion 
of gapped position in any pairwise distance for sequences separated by path 
length, t. 

The fraction of gapped positions will be a function of the evolutionary 
distance with: 2 ;( 0 ) = 0 because at no distance there are no opportunities 
for indels, and z{t) < 1 for all t. The latter property can be seen by treating 



one of the two sequences as if it were the ancestral sequence. This is per¬ 


missible because we have assumed that the indel process is time reversible. 
The probability of a residue surviving from the ancestral sequence to the 
descendant sequence is described by an exponential function with rate pa¬ 
rameter controlled by the rate of deletions. This probability remains > 0 for 
all values of the evolutionary distance, hence there is a non-zero probability 
of an ungapped position, and z{t) cannot equal 1. 

In a typical consistency proof, we consider sequences of ever increasing 
length. We note that indel models (e.g. the TKF91 model) imply a equilib¬ 
rium sequence length. Here we discuss statistical consistency by considering 
what happens as the number of loci increases without bound, but the equi¬ 
librium length of each locus is determined by the parameters of the indel 
model. Hence the total sequence length approaches infinity, while it is still 
appropriate to describe the sequence as being generated by the indel process. 

For the standard substitution models, we can consistently estimate the 
distance if the indel process has insertion and deletion rates of 0. In this case 
there are no gapped columns and z{t) = 0. In the more general case, if we 
only consider site patterns in which no gaps occur, the probability of a site 
pattern s for branch length t is Pr(s|t) = (1 — 2 ;(t)) Pp(s|t) where Pp(s|t) 
is the usual site pattern probability (when we have no missing data caused 
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by gaps), and (1 — z{t)) is the probability of not containing a gap. The 
multiplication of the probability of not containing a gap by the probability 
of the site pattern given the branch length is valid whenever the substitution 
and indel process are independent. We can see that calculating the proba¬ 
bility of each ungapped site pattern by using the fraction of ungapped sites 
that display the pattern will result in Pr'(s|t) because this will constitute 
dividing the probability of each pattern by (1 — z{t)). Thus the spectrum of 
ungapped pattern frequencies will converge to exactly the same frequencies 
of the patterns when there are no indel events. 

Thus, if the indel process and substitution process are independent, 
treating gaps as missing data will not cause statistical inconsistency of 
distance-based tree inference. Note that this result does not contradict the 


proof by Warnow (2012) that treating gaps as missing data can lead to 


inconsistency in maximum likelihood. Her proof focuses on the maximum 
likelihood criterion and is restricted to the case in which internal branch 
lengths for the substitution process are equal to 0 (there are no substitution 
events). Internal branch lengths of 0 lead to inconsistency without the com¬ 
plication of an indel process. Our result applies to cases in which the tree 
method is capable of consistently estimating the tree if there are no indels. 
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Cases in which indel processes and snbstitntion process are 
not independent 


If the occurrence of an indel affects the probability of a snbstitntion, then the 
previous argument does not hold. In fact, we cannot use the argument above 
under any violation of the independence assumption. For example, if some 
subset of sites is constrained by evolution and thus free of both substitutions 
and indels, then it is possible for the gaps-as-missing-data convention to lead 
us to the wrong tree. In such cases, the gapped sites are a biased sample with 


respect to the substitution process. See Rome et al. (2013) for a discnssion 
of other contexts in which non-random patterns of missing data perturb 
phylogenetic estimation. Specifically, if the distribution of missing sites is 
not independent of the evolutionary rates at those sites this bias can lead to 


problems in phylogenetic reconstruction (Grievink et al., 2013; Rome et al. 


2013). 


Paired invariants model 


Consider the case of sequences being generated under the TKF91 (Thorne 


et al., 1991) indel model and the Jukes-Cantor (JC) pukes and Cantm 


1969) substitution model, but with invariant sites. In particular, let the 


“paired invariant sites” model refer to the case in which some fraction of 
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sites are free from both indels and substitutions and the other parts of the 
sequence are described by the TKF91 and JC models. In terms of the 
formalism of the TKF91 model, this would require that the each invariant 
site which is followed by a region that is free to vary is considered to have a 
new “immortal link” to the right of the invariant site. We consider the case 
in which alignment is known without error. 

Let pinv denote the expected proportion sites in a sequence which are 
invariant with respect to indels and substitutions. In the TKF model single 
nucleotide insertions and deletions can occur at any site in the alignment 


(Thorne et ah, 1991). Under the TKF model, at equilibrium the expected 
rate of insertions per locus is equal to the expected rate of deletions per 
locus. The TKF model is usually described with insertion rates per link and 
deletion rates per link. In that parameterization the insertion and deletion 
rates can differ. We call the deletion rate scaled per nucleotide 9. 

When computing a pairwise distance, the gaps-as-missing-data correc¬ 
tion removes sites in which either sequence has a gap from consideration. 
The expected length of a locus under the paired invariants model will be 
denoted N. This will be a function of the expected length of each block 
of variable sites, which is a function of the insertion rate relative to the 
deletion rate. Our argument applies to any insertion rate which leads to 
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a non-infinite equilibrium sequence length. So we phrase the argument in 


terms of the per-locus expected length and do not use the insertion rate 
parameter explicitly in our argument. 

Under the TKF91 model, each block of variable sites is expected to fol¬ 
low a geometric distribution with a parameter that depends on the ratio 
of the per-link insertion and deletion rates. Because sites with an insertion 
and then a deletion are typically culled from an alignment, we consider a 
pairwise alignment length to be the length of the correct alignment after all 
positions with gaps in both members of the pair are removed. Even though 
the expected number of nucleotides in each sequence does not change, the 
insertion of new positions and deletion of sites means that the pairwise align¬ 
ment length grows as a function of the evolutionary distance. In the paired 
invariants model, let a{t,6,pinv) denote the expected length of a pairwise 
alignment of two sequences separated by path length, t. Then: 


fl(0, 0, Pinv) 

= N 

(1) 

lim pinv) 

t^oo 

= lV(pinv + 2(1 -Pinv)) 

(2) 


where Npinv is the number of invariable columns in the alignment. N{1 — 
Pirn) columns are expected to be in the ancestor but deleted along the path 
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to the descendant. Because the process started at equilibrium, we expect 
them to be replaced by A^(l — Pmv) inserted sites. 

For each site that is free to vary in the ancestor, the probability that it 
survives to the descendant is using the exponential distribution. We 
refer to columns where there is a nucleotide in both the ancestor and the 
descendant as “ungapped columns”. The expected number of ungapped 
columns is 

b{t, 6,Pmv) = N (pinv + (1 - Pinv)e“®*^ 

Note that limt^oo &,Pinv) = 

The expected proportion of residues in a sequence which are free to vary 
remains constant at 1—Pinv as branch length approaches infinity. However, if 
we consider only ungapped columns in the true alignment of two sequences, 
we see that the proportion of these sites which are variable approaches 0 as 
deletions continue to reduce the number of aligned columns among the class 
of variable sites. The expected proportion of ungapped columns that are 
free to vary is; 

Pr(variable | ungapped) = 


iV(l-pinv)e-^* 
bit, 9, Pinv) 

(1 - Pinv)e~^* 

Pinv + (1 -Pinv)e“®* 
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Figure 2: Properties of the paired invariants model with pinv = 0.2 and 9 = 
0.1. A. The proportion of aligned sites which are free to vary as a function 



t t 


This function is plotted in Fig. for the case when = 0.2 and 9 = 0.1. 

Recall that under the Jukes-Cantor model the probability of a site hav¬ 
ing a different nucleotide from its ancestor across a path of length t is 
I ^1 — For the Jukes-Cantor model with invariant sites the prob¬ 

ability of a difference, conditional on a site being a member of the variable 
class is: 


Pr(difference | ungapped, variable) = 



it 

3(1-Pinv) 


V 


(4) 


The only difference between this formula and the Jukes-Cantor transition 
probability is the inclusion of a 1 — pinv factor to increase the rate of sub¬ 
stitution for the variable sites. This is included to adhere to the common 
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convention that the mean rate of substitutions is equal to 1.0 per site. 

For a pair of sequences, the probability of seeing a different state at 
a randomly chosen, ungapped, variable site (Eqn. is a monotonically 
increasing function of t. However, the proportion of ungapped sites which 
are variable decreases, as was shown in Eqn. (©• The expected pairwise 
distance for the paired invariants model when measured as the expected 
proportion of ungapped positions that differ between the tips is: 

E[p] = Pr(difference | ungapped, variable) Pr(variable | ungapped) 

_ 4f 

_ 3(1 - pinv)e~^*(l - e 3(i-Pinv))) 

4(pinv + (1 -Pinv)e“^0 

This expected p-distance is shown in figure [^. Note that it is not a mono¬ 
tonically increasing function. 

Gaps-as-missing distance correction 

Under a gaps-as-missing analysis, only the ungapped columns are relevant 
in distance calculations. Thus, the expected p-distance shown in Eqn. ([^ 
fills the role of g{t) in the discussion of our proofs about the consistency 
of distance-based tree estimation. Note that the substitution model for the 
paired invariant sites model is simply the Jukes-Cantor substitution model 
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Figure 3: The transformed evolutionary distance t values as a function of 

true evolutionary distance t (Eqn. 1^. 

2 . 0 - 



t 


with invariant sites. If we assume that we know the (correct) proportion of 
invariant residues in the generating process, then the distance correction for 
this model is: 


/ Hp) 


3(1 Pinv) , 

-4- 


_ Ap \ 

3(1 -Pinv)y ’ 


( 6 ) 


We can combine Eqn. ([^ and Q to express the transformed evolutionary 
distances t as a function of the true evolutionary distance, t: 


t 


3(1 Pinv) , 

-4- 


3(1 Pinv) 1 
-4- 


PinvH“(l Pinv)^ 

3(1 - Pinv) 

V / 



-4t 

Pinv T (1 Pinv)6 


( 7 ) 


This function is shown in Figure 
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Clearly the function is not linear; indeed it is not monotonically increas¬ 
ing. In fact, the function is not linear even at small path lengths. The first 
and second derivatives of the distance correction with respect to t (see Ap¬ 
pendix B) are somewhat complicated. However, when evaluated at t = 0, 
the first derivative is 1 and the second derivative of the expected value of 
the distance correction is —2pi^^6. Thus, the gaps-as-missing-data approach 
coupled with the correct substitution model results in a concave distance 
correction function whenever both pinv > 0 and 9 > 0. Lemma of Ap¬ 
pendix A states that this will lead to statistically inconsistent estimation of 
the tree topology for some tree shapes. 


Conclusions 


We have extended the work of Susko et ah (2004) by proving that there is 


a tree shape which will lead to the positively misleading estimation of an 
incorrect tree topology when the distance correction function is convex. We 
have also proven that the commonly applied gaps-as-missing-data approach 
will not lead to statistical inconsistency of distance estimates if the indel and 
substitution processes are independent. However, sequence evolution follows 
the paired invariants model, the deviation from independence is sufficient to 
lead to inconsistency of the distance estimates and the tree topology. 
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Obviously, the paired invariants model with a Jukes-Cantor substitution 
process is an extremely simple model which does not accurately depict the 
evolution of real sequences. Nevertheless, the paired invariants model encap¬ 
sulates a simple idea that has been at the core of thinking about molecular 


evolution ever since Kimura (1968): constant sites probably are constrained 
because they play an important functional role. It seems entirely plausi¬ 
ble that the subset of functionally important sites in the genome would be 
prevented from experiencing fixation of indels or substitutions. Thus it is 
troubling that adding this idea to the simplest possible substitution model 
is sufficient to lead to inconsistency of phylogenetic inference. 

One obvious solution would be to rely on distance corrections which do 
not treat gaps as missing data. Another option may be using multiple val¬ 
ues of pinv to correct for the fact that the proportion of ungapped positions 
which correspond to constrained sites is likely to be higher for comparisons 
over long evolutionary timespans. Both the proportion of gapped sites in 
the correct pairwise sequence alignment and the proportion of ungapped 
positions which are variable (shown in Figure [2]A) are monotonically chang¬ 
ing functions of the path length. This implies that it may be possible to 
devise some recipe for correcting distances that uses a pair-specific value of 
Pinv, and that this pair-specific pinv could be calculated from an observable 
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property of an alignment. Such a procedure might rescue distance-based 


from inconsistency when the data are generated by the paired invariants 
model. However, this form of inference would probably be sensitive to slight 
inadequacies of the model because accounting for rate heterogeneity when 
using pairwise data alone is notoriously difficult. Our results underscore that 
fact that phylogenetic inference is a problem that is so difficult that even 
subtle forms of ascertainment bias can lead to fundamental misbehavior of 
inference methods. 
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Appendix A 

Suppose that distances are generated on a tree by a model Mg and corrected 
assuming a model Mj. 

Theorem 1 Suppose that f{t) and g{t) (the functions for correcting and 
generating p-distances respectively) are analytic functions oft that are strictly 
increasing in some neighbourhood of 0, and satisfy /(O) = g{0) = 0. Let 
h{t) = t = f~^{g{t)) (the transformed evolutionary distances). Then pre¬ 
cisely one of the following conditions holds: 
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• The correction process f is equal to the generating function g up to a 
scalar multiple (i.e. f{t) = g{t/c) and so hit) = ct for all t G [0, p), for 
some constant c). In this case NJ will select the correct tree topology 
when applied to the transformed evolutionary distances; or 

• The correction process f is not equal to the generating function g up to 
a scalar multiple. In this case there exists a binary tree on four leaves 
with an associated set of strictly positive branch lengths for which NJ 
will select an incorrect tree topology when applied to the transformed 
distances. 


The proof of this result involves combining five lemmas; the first is stan¬ 


dard, the second is a formal statement of results from Susko et al. (2004), 


the third is new, and the fourth and hfth are technical lemmas. 


Lemma 2 fSaitou and Nei (lOSl) p.413] NJ applied to distance data on 
four taxa (A, B,C, D) returns the quartet tree AB\CD if 
dAB + dcD < niin{(i^c' + dsD, dAD + dsc}- 


Lemma 3 Suppose the transformed distance function h{t) is strictly con¬ 
cave and increasing on the interval [A,2A] for some A > 0. For any a > 0 
sufficiently small, distances on Felsenstein-Zone tree of Fig. \^A ) that are 
transformed by h have the property that NJ will estimate the incorrect tree 
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topology (AD\BC). 


Proof of Lemma 

By Lemma[2j for any distance function d on four taxa i, j, k, I, NJ applied 
to d will return the quartet tree ij\kl when i,j minimizes the pairwise sum 
dij+dki- Let us now put dij = h{tij) = Lj (i.e. the transformed evolutionary 
distances). Consider the three pairwise sums: 

(51) d^B + dcD = 2/i(A + cr); 

(52) dAc + dsD = 2/i(A + 2a); 

(53) dAD dsc — h{3o') h(2X O')] 

Since h is strictly increasing on [A, 2A], the expression (S2) is always greater 
than (SI) for any cr > 0. Thus it suffices to show that case (S3), which 
corresponds to NJ returning the tree AD\BC, is less that (SI) for sufficiently 
small CT > 0. To this end, note that since h is strictly concave on [A, 2A] we 
have: h{2\) < 2h{X), so if we let 

q{x) := 2h{X + x) — h{2X + x) — h{2>x) 

then q{0) = 2h{X) — h{2X) — h{0) > 0 (recall h{0) = 0). Since h is continuous 
(by virtue of being analytic) q is too, so it follows that for any sufficiently 
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small (but strictly positive) value of a we have q{a) > 0. Because q{a) 
equals the quantity described by (SI) minus that described by (S3), when 
q{a) > 0 then NJ will prefer tree AD\BC over the true tree AB\CD. □ 

Lemma 4 Suppose the transformed distance function t = h{t) is strictly 
convex and increasing on the interval [A, 2A] for some A > 0. For any a > 0 
sufficiently small, distances on the ‘twisted’ Farris-Zone tree of Fig. SB) 
that are transformed by h have the property that neighbor-joining will esti¬ 
mate the incorrect tree topology (AD\BC). 

Proof of Lemma^ For the ‘twisted’ Farris-Zone tree of Fig. [^B) con¬ 
sider the three pairwise sums: 

(51) dji,B + dQ£) = h{2\ -|- cr) -|- /i(3cj); 

(52) dj^c F djsD — h{X F 4cj ) F h[X -\- 2cj); 

(53) dAD F dsc =‘^h{X + 3a). 

Now, if h is strictly convex on [A, 2A] and if x,y G [A, 2A] then: 
h < \[Hx) F h{y)]. 
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Applying this with x = A + 4(T and y = X + 2a, where 0 < a < A/4, gives: 


/i(A + 3(7) < 2 [h(A + 4(7) + /i(A + 2a )], 
which gives (S3)<(S2). 

Again by convexity, h{2X) > 2h{X), so if we let 

q{x) := h{2X + (7) + h{3a) — 2h{X + 3(7). 

then q{0) = h{2X) — 2h{X) + /i(0) > 0. By a similar continuity argument 
as in the concave case, there exists a > 0 so that q{a) > 0. Because q{a) 
equals the quantity described (SI) minus that described by (S3), conditions 
for which q{a) > 0 are conditions for which NJ will again prefer tree AD\BC 
over the true tree AB\CD. 

□ 

Lemma 5 Under the assumptions on f and g in Theorem the trans¬ 
formed distance function h is a strictly increasing analytic function of t on 
[0, p) for some p > 0. 

Proof of Lemma The proof that h is analytic is straightforward, since 
analytic functions (in particular / and g) are closed under composition, and 
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also under functional inverse (providing their derivative is non-zero, as it is 
here). To see that h is increasing, at least close to 0, note that, by elementary 
differential calculus, we have: 


dt 


h{t) 


9'{t) 

f'if-Hgm' 


( 8 ) 


By assumption, / and g are both increasing in some neighbourhood of 0, and 
since f~^{g{0)) = /“^(O) = 0, there exists p > 0 for which the numerator 
and denominator of Inequality Q are both strictly positive for all t G [0, p). 
□ 

Lemma 6 Suppose H{t) is a real-valued function that is analytic in [0,/9) 
for some p> 0, and that satisfies H{0) = 0. If H{t) ^ ct on [0,p) for some 
constant c, then there exists some value s > 0 so that H(t) is either strictly 
concave on the interval [s/2, s] or strictly convex on the interval [s/2, s]. 

Proof of Lemma\^ 

If H”{0) > 0 then since H” is continuous at 0, there is a value s G [0, p) 
so that H”{t) > 0 for all t G [0,s] and so H is strictly convex on [s/2,s]. 
Similarly, if H"(0) < 0 then H is strictly concave on [s/2, s] for some s > 0. 
Suppose that H"{t) = 0. Then either (i) there exists a smallest A: > 2 for 
which / 0 (call this value fci) or (ii) H^{0) = 0 for all k > 2. In 


27 



Case (i), suppose first that a := > 0. A Taylor series expansion 

of H about 0 gives H{t) = + • • • where the remaining terms are of 

order and higher. Thus, for a sufficiently small u G {0,p), we have 
H”{t) = + (terms of order and higher) so > 0 

for all t G (0, v). In particular, for any strictly positive value of s less than u 
we have H"{t) > 0 for all t G [s/2, s]. Thus, as before, H" is strictly convex 
on [s/2, s]. A similar argument (for strict concavity) applies if < 0. 

In Case (ii) the Taylor expansion of H{t) on [0,p) centered on 0, shows 
that H"{t) = 0 for all t G [0,/?). By integrating (twice) it follows that 
H{t) = ct + H{0) for all t G [0, p), for some constant c. Since H{0) = 0, this 
gives H(t) = ct, as claimed. □ 

Proof of Theorem [^' By Lemma [^ h and analytic and increasing in 
[0, p), so by Lemma 1^ if we take H{t) = h{t) then if h is not linear, it is 
either strictly concave or strictly convex on an interval of the form [s/2,s] 
for some s G (0,/9). Theorem now follows from Lemmas and □ 
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Appendix B 


The first and second derivatives of the expected corrected distance (equation 
with respect to the path length t are: 

it 

C[t\ = 

d{t] = 

_ 4 - 3(e®* - d[t])pl^^9 + pinv [4 + 30] - 4 - 3d[t]9) 

Ot 4(1 c[t]pinv T d[t]pinv) (1 T Pinv') 

V[t\ = 

w[t] = 
x[t] = 
y[t] = 

m = Pinv - 1 

u[t] = —16c[t]m^ + + 9u[t]m^Pinv0^ — 

+x[t]pf^^ (4 - SmOf + 16?/[t]pinv (2 + 30 + 3pfjj^0 - 3pinv[l + 20]) 
-d[t]m (3pL( 8 - 30)0 + 9pL02 + (4 + 30)^ - 3pinv[16 + 160 + 30^]) 

dH _ _ PmvUjt] _ 

dt^ 12m (1 - c[t]pinv + d[t]Pinv)^ (1 + e^^Pinv - Pinvf 

These were calculated using the Mathematica notebook included as part of 
the supplementary materials. 
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